{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b35d2e8f",
   "metadata": {},
   "source": [
    "# Comp1-1传统组学特征提取\n",
    "\n",
    "主要适配于传统组学的建模和刻画。典型的应用场景探究rad_score最最终临床诊断的作用。\n",
    "\n",
    "数据的一般形式为(具体文件,文件夹名可以不同)：\n",
    "1. `images`文件夹，存放研究对象所有的CT、MRI等数据。\n",
    "2. `masks`文件夹, 存放手工（Manuelly）勾画的ROI区域。与images文件夹的文件意义对应。\n",
    "\n",
    "## Onekey步骤\n",
    "\n",
    "1. 数据校验，检查数据格式是否正确。\n",
    "2. 组学特征提取，如果第一步检查数据通过，则提取对应数据的特征。\n",
    "5. 查看一些统计信息，检查数据时候存在异常点。\n",
    "6. 正则化，将数据变化到服从 N~(0, 1)。\n",
    "7. 通过相关系数，例如spearman、person等筛选出特征。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb054795",
   "metadata": {},
   "source": [
    "## 一、数据校验\n",
    "首先需要检查诊断数据，如果显示`检查通过！`择可以正常运行之后的，否则请根据提示调整数据。\n",
    "\n",
    "**注意**：这里要求images和masks文件夹中的文件名必须一一对应。e.g. `1.nii.gz`为images中的一个文件，在masks文件夹必须也存在一个`1.nii.gz`文件。\n",
    "\n",
    "当然也可以使用自定义的函数，获取解析数据。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a4cbe42",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ['KMP_DUPLICATE_LIB_OK'] = 'TRUE'\n",
    "from pathlib import Path\n",
    "from onekey_algo.custom.components.Radiology import diagnose_3d_image_mask_settings, get_image_mask_from_dir\n",
    "from onekey_algo import OnekeyDS as okds\n",
    "\n",
    "os.makedirs('img', exist_ok=True)\n",
    "# 设置数据目录\n",
    "# mydir = r'你自己数据的路径'\n",
    "mydir = okds.ct\n",
    "\n",
    "# 生成images和masks对，一对一的关系。也可以自定义替换。\n",
    "images, masks = get_image_mask_from_dir(mydir, images='images', masks='masks')\n",
    "\n",
    "# 自定义获取images和masks数据的方法，下面的例子为，每个样本一个文件夹，图像是以im.nii结尾，mask是以seg.nii结尾。\n",
    "# def get_images_mask(mydir):\n",
    "#     images = []\n",
    "#     masks = []\n",
    "#     for root, dirs, files in os.walk(mydir):\n",
    "#         for f in files:\n",
    "#             if f.endswith('im.nii'):\n",
    "#                 images.append(os.path.join(root, f))\n",
    "#             if f.endswith('seg.nii'):\n",
    "#                 masks.append(os.path.join(root, f))\n",
    "#     return images, masks\n",
    "# images, masks = get_images_mask(mydir)\n",
    "\n",
    "diagnose_3d_image_mask_settings(images, masks)\n",
    "print(f'获取到{len(images)}个样本。')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e28d5875",
   "metadata": {},
   "source": [
    "# 传统组学特征\n",
    "\n",
    "使用pyradiomics提取传统组学特征，正常这块不需要修改，下面是具体的Onekey封装的接口。\n",
    "\n",
    "```python\n",
    "def extract(self, images: Union[str, List[str]], \n",
    "            masks: Union[str, List[str]], labels: Union[int, List[int]] = 1, settings=None)\n",
    "\"\"\"\n",
    "    * images: List结构，待提取的图像列表。\n",
    "    * masks: List结构，待提取的图像对应的mask，与Images必须一一对应。\n",
    "    * labels: 提取标注为什么标签的特征。默认为提取label=1的。\n",
    "    * settings: 其他提取特征的参数。默认为None。\n",
    "\n",
    "\"\"\"\n",
    "```\n",
    "\n",
    "```python\n",
    "def get_label_data_frame(self, label: int = 1, column_names=None, images='images', masks='labels')\n",
    "\"\"\"\n",
    "    * label: 获取对应label的特征。\n",
    "    * columns_names: 默认为None，使用程序设定的列名即可。\n",
    "\"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c28f8c9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from onekey_algo.custom.Manager import onekey_show\n",
    "# 特征提取视频\n",
    "onekey_show('传统组学任务|特征提取')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79b88fd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "import pandas as pd\n",
    " \n",
    "warnings.filterwarnings(\"ignore\")\n",
    "\n",
    "from onekey_algo.custom.components.Radiology import ConventionalRadiomics\n",
    "\n",
    "# 如果要自定义一些特征提取方式，可以使用param_file。\n",
    "# param_file = r'./custom_settings/exampleCT.yaml'\n",
    "param_file = None\n",
    "radiomics = ConventionalRadiomics(param_file, correctMask= True)\n",
    "radiomics.extract(images, masks)\n",
    "rad_data = radiomics.get_label_data_frame(label=1)\n",
    "rad_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "268fd11a",
   "metadata": {},
   "source": [
    "# 特征维度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1de1e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "rad_data.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f8b1f6e",
   "metadata": {},
   "source": [
    "## 获取到数据的统计信息\n",
    "\n",
    "1. count，统计样本个数。\n",
    "2. mean、std, 对应特征的均值、方差\n",
    "3. min, 25%, 50%, 75%, max，对应特征的最小值，25,50,75分位数，最大值。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa5c306d",
   "metadata": {},
   "outputs": [],
   "source": [
    "rad_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0a13a21",
   "metadata": {},
   "source": [
    "## Z-score\n",
    "\n",
    "`normalize_df` 为onekey中正则化的API，将数据变化到0均值1方差。正则化的方法为\n",
    "\n",
    "$column = \\frac{column - mean}{std}$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "33fe3d76",
   "metadata": {},
   "outputs": [],
   "source": [
    "from onekey_algo.custom.components.comp1 import normalize_df\n",
    "data = normalize_df(rad_data, not_norm=['ID'])\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd361f98",
   "metadata": {},
   "source": [
    "### 相关系数\n",
    "\n",
    "计算相关系数的方法有3种可供选择\n",
    "1. pearson （皮尔逊相关系数）: standard correlation coefficient\n",
    "\n",
    "2. kendall (肯德尔相关性系数) : Kendall Tau correlation coefficient\n",
    "\n",
    "3. spearman (斯皮尔曼相关性系数): Spearman rank correlation\n",
    "\n",
    "三种相关系数参考：https://blog.csdn.net/zmqsdu9001/article/details/82840332"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b676c43",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pearson_corr = data.corr('pearson')\n",
    "# kendall_corr = data.corr('kendall')\n",
    "spearman_corr = data.corr('spearman')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c74ed30",
   "metadata": {},
   "source": [
    "### 特征可视化\n",
    "\n",
    "特征热图，不同特征之间的两两相关性。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "611c9734",
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from onekey_algo.custom.components.comp1 import draw_matrix\n",
    "plt.figure(figsize=(100.0, 80.0))\n",
    "\n",
    "# 选择可视化的相关系数\n",
    "draw_matrix(spearman_corr, annot=True, cmap='YlGnBu', cbar=False)\n",
    "plt.savefig(f'img/feature_corr.svg', bbox_inches = 'tight')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8c1788a2",
   "metadata": {},
   "source": [
    "### 聚类分析\n",
    "\n",
    "通过修改变量名，可以可视化不同相关系数下的相聚类分析矩阵。\n",
    "\n",
    "注意：当特征特别多的时候（大于100），尽量不要可视化，否则运行时间会特别长。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3662e847",
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "pp = sns.clustermap(spearman_corr, linewidths=.5, figsize=(50.0, 40.0), cmap='YlGnBu')\n",
    "plt.setp(pp.ax_heatmap.get_yticklabels(), rotation=0)\n",
    "plt.savefig(f'img/feature_cluster.svg', bbox_inches = 'tight')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fcb5c23",
   "metadata": {},
   "source": [
    "### 特征筛选\n",
    "\n",
    "根据相关系数，对于相关性比较高的特征（一般文献取corr>0.9），两者保留其一。\n",
    "\n",
    "```python\n",
    "def select_feature(corr, threshold: float = 0.9, keep: int = 1, topn=10, verbose=False):\n",
    "    \"\"\"\n",
    "    * corr, 相关系数矩阵。\n",
    "    * threshold，筛选的相关系数的阈值，大于阈值的两者保留其一（可以根据keep修改，可以是其二...）。默认阈值为0.9\n",
    "    * keep，可以选择大于相关系数，保留几个，默认只保留一个。\n",
    "    * topn, 每次去掉多少重复特征。\n",
    "    * verbose，是否打印日志\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4fe3df7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from onekey_algo.custom.components.comp1 import select_feature\n",
    "sel_feature = select_feature(spearman_corr, threshold=0.9, topn=10, verbose=False)\n",
    "sel_feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb92f704",
   "metadata": {},
   "outputs": [],
   "source": [
    "data[['ID'] + sel_feature].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e88423c1",
   "metadata": {},
   "source": [
    "### 过滤特征\n",
    "\n",
    "通过`sel_feature`过滤出筛选出来的特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86e09b3f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sel_data = data[['ID'] + sel_feature]\n",
    "sel_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96ae6a46",
   "metadata": {},
   "source": [
    "### 保存特征\n",
    "\n",
    "保存CSV特征，为了后续做多组学融合使用。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5012998",
   "metadata": {},
   "outputs": [],
   "source": [
    "sel_data.to_csv('rad_features.csv', header=True, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eda98f29",
   "metadata": {},
   "source": [
    "### 特征读取\n",
    "\n",
    "为了和后续其他的组件串联，需要特征读取模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "809037a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas\n",
    "read_data = pandas.read_csv('rad_features.csv', header=0)\n",
    "read_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "916fbfda",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
